This dataset consist of all the tracks that currently exists on my Spotify Library. In other words, all of the tracks (or group of them) that I’ve liked. Told Spotify that I’m into that song. Each row consist of a single track i.e. an event registering. The goal of this analysis is to delve into and explore on how the habit of music liking on this specific platform has changed. Even though I, being the one feeding the data, have an overall ideia on how things changed, I have a feeling that some surprises might arise. Or not. But I’m really into music, to say that I listen daily is far from being an overstaement. As far as remembered to like tracks that I indeed liked, Spotify has been my main gateway to music listening, so a realiable data source.

Missing values

Something that can be helpful if done before our analysis is spotting and correcting for missing values. So we’ll start by omitting them

## [1] FALSE

Are there any missing values? FALSE

Univariate analysis

Let’s take a look on all the variables and define which

  1. we are scratching de surface and
  2. we are digging deeper
##  [1] "X"                "added_at"         "artist_id"       
##  [4] "artist_name"      "duration_ms"      "explicit"        
##  [7] "id"               "name"             "popularity"      
## [10] "acousticness"     "danceability"     "duration_ms.1"   
## [13] "energy"           "id.1"             "instrumentalness"
## [16] "key"              "liveness"         "loudness"        
## [19] "mode"             "speechiness"      "tempo"           
## [22] "time_signature"   "valence"

It’d be interesting to analyze:

  1. How artists’ count rank
  2. The mean duration of tracks
  3. Median values on track features: [key, mode, danceability, liveness, speechiness, valence, loudness, tempo, instrumentalness, popularity]
  4. Time-series of the liking event itself

Artists

The question is rapid: how is my artist selection ranking?

That’s interesting. I couldn’t remember adding “Daniel Grau” songs but if someone else analyzing this data would assume I’m a huge fan (which I’m really not). J Cole, on the other hand, really is an artist that I

We can definately observe that the majority of artists have only one track saved. But it made me wonder how high this ratio is.

It is indeed really large. 99% of artists from my library have, at most 2, tracks there.

Time dimension

It’s important to know the span of the data. Which was the first song and which was the latest?

Firts we need to convert from Factor to a date format.

## [1] 3.232877

That’s approximately 3 years and 2 months worth of music saving.

Track Features

This part really interests me and, since there are a fair amount of features, I’m condensing those analysis. I’m focusing on features that might hold interesting information: Danceability, energy, instrumentalness, speechiness, popularity and acousticness. These information can tell a lot of a given track, let alone a library of them.

It’s important to point out that most of these features that have a [0, 1] range holds information on its magnitude. For example, if a songs has a 0.95 instrumental grade, it means that it’s as purely instrumental as it can be. On the other hand, if a song is graded 0.5 on energy, means it’s an average-energetic song. Most of those features are extracted by extensive waveform analysis.

Summary

##   danceability        energy        speechiness      instrumentalness
##  Min.   :0.1030   Min.   :0.0332   Min.   :0.02500   Min.   :0.0000  
##  1st Qu.:0.6780   1st Qu.:0.5200   1st Qu.:0.04400   1st Qu.:0.2465  
##  Median :0.7570   Median :0.6620   Median :0.05450   Median :0.8300  
##  Mean   :0.7263   Mean   :0.6441   Mean   :0.08588   Mean   :0.6290  
##  3rd Qu.:0.8020   3rd Qu.:0.7850   3rd Qu.:0.07845   3rd Qu.:0.8950  
##  Max.   :0.9830   Max.   :0.9910   Max.   :0.68100   Max.   :0.9720  
##    popularity     acousticness      
##  Min.   : 0.00   Min.   :0.0000055  
##  1st Qu.: 6.00   1st Qu.:0.0017400  
##  Median :19.00   Median :0.0159000  
##  Mean   :21.13   Mean   :0.1360420  
##  3rd Qu.:33.00   3rd Qu.:0.1455000  
##  Max.   :78.00   Max.   :0.9700000

Danceability

This distribution is of great interest. And it’s no suprise that this plot is skewed left. That is, most of the songs lie on the [0.7, 0.8] range which enlightens how much I’ve saved dance musics.

Energy

Perhaps energy has some latent information with danceability? Even though this distribution is not really left skewed, most of the songs are above 0.5 mark.

Speechiness

It’s clear that the songs that I’ve saved don’t have much vocals. Distribution is highly right skewed and also demonstrates a clear pattern of the dataset.

Instrumentalness

It’s also no news that most songs do have a high instrumentalness ratio. However, it is surprising that relatively high count of 0 values. That is because I know my taste and appreciation for instrumental songs.

Acousticness

Perhaps acousticness share latent information with speechiness. A clearly right skewed plot with most values ranging from [0, 0.2].

Popularity

These plots are enlightening. They clearly show a pattern here: I tend to like tracks that are danceable, energetic, mostly instrumental with very few vocals. This came somewhat expectedly, due to my taste in electronic music. But I also tend to listen to rap and hip hop, that have vocals. It’s not reflecting on my saved tracks. Perhaps I don’t like it that much?

Keys

Looks like we have, besides the average, many tracks on D minor and G; also very little tracks on E minor.

What is the structure of your dataset?

The chosen dataset is relatively small one, with 923 records but also relatively dense: 23 features. Each row consist of an event of track saving on my Spotify library. Most of the features have values ranging from 0 to 1. It’s the magnitude of a given feature e.g. instrumentalness indicates, from 0 to 1, on how instrumental a track is.

What is/are the main feature(s) of interest in your dataset?

Features I’m most interested in have had their distributions plotted above. They are: Danceability, energy, instrumentalness, speechiness, popularity and acousticness. Even though there are many features, a few do might hold some information on how my taste have evolved, which are the ones I selected.

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

For this first brief analysis I only had to take out a single missing value. Besides that, variables had been analyzed alone, so no feature engineering was done.

Bivariate Plots Section

Now we’re ready to explore how the features relates with themselves. As of now, we’ll limit to pair – bivariate – analysis. The first thing that come to mind is to explore the time dimension of this dataset.

Time added

Now we can analyze grouping by date:

It clearly has an increase from the middle of 2017 on. If we grouped this data monthly we’d see a trend maybe?

I was almost right. July 2017 was the month I added 172 songs. That’s 18% of the entire dataset volume. You can see that the second highest is July 2018. Maybe there’s a pattern there?

The graph confirms our assumption: fore some reason songs are added more on the second semester than the first one.

Danceability and popularity

How much danceability correlates with popularity?

Well it seems that not so much. By analyzing the graph and density points we can tell that most songs that are danceable do not rank high in popularity.

Danceability and tempo

It seems that tracks that are danceable apparently have high energy.

Energy and acousticness

There seems to be a correlation between a track’s energy and how loud it is. A little (too much) expected. But how correlated are they?

## 
##  Pearson's product-moment correlation
## 
## data:  energy and loudness
## t = 22.152, df = 921, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5458111 0.6301299
## sample estimates:
##       cor 
## 0.5895744

And indeed it is a high correlation: 0.589 on Pearson’s grade. From this we can tell that both features carry the same latent information.

Instrumentalness and music-key

Danceability have interesting relationships: roughly speaking, tracks that are less popular are more danceable; and tracks that have more energy are more danceable. Important to notice that this is a specific set from a larger dataset and it might be the case that I simply don’t listen to a lot of popular tracks, so this relationship holds in reference to this set.

It seems that there might be clues on how instrumental a song just by looking at its key. Indicates that might hold some interest on what kind of instrumental tracks I roughly prefer: low pitched (or bass-rich) tracks.

Multivariate Analysis

What influences danceability?

I suspect that a few features have latent relationships with danceability. And since understanding what lies behind this feature is of much interest, let’s plot it:

## <ScaleContinuous>
##  Range:  
##  Limits:    0 --    1

Track Profile

When I noticed that a lot of features have the same range ([0.0, 1.0]). So what does an average track sampled from my library would look like?

And it’s confirmed: I like musics that are danceable, energetic and with lots of instrumentals. Also, these tracks tend to be low on acoustic and speechiness and not popular.

Overall time lapse

If we were to plot, aggregating monthly, the average on each feature, how would these features place evolve (if they evolve at all) through time?

Well, that before metioned evolution don’t exist. Those time series are rather stochastic. But when it comes to danceability and energy, it stands above a baseline. Meaning that throughout this time my taste for energetic and danceable tunes remained rather the same. It’s interesting also to note how my listening to acoustic songs vary over time.

Timelapse on tiles

That scatterplot above was more confusing than elusive. Now this heatmap is great at displaying patterns. It is crystal clear on how each feature, on average, changed month to month. The assumptions behind danceability, energy and acousticness are clearly verifiable.


Final Plots and Summary

Danceability and Popularity

Description

The relationship between how danceable a track is and how popular it often is striking to me. Making the points more opaque and with density plots we have a clear understanding of both distributions and the relationship between them. It’s also worth noting that I myself am not a very mainstream listener but nevertheless, it’s clear that popular songs don’t tend to be too danceable.

Loudness and Energy

Description

A previous assumption that I had that as greatly confirmed. The energiness of a song is linearly dependent on how loud it is. This discovery is important to me because I’m an aspiring DJ and unraveling those kind of relationships are of great help to better understand how to classify songs.

Feature Evolution

Description Three

This timeline tile-map shows that my taste over the years didn’t change much. Which was a surprise to me. I had the belief I was an ecletic and abroad listener. But the data doesn’t lie and even though instrumentalness and valence clearly varied throughout the months,


Reflection

This analysis was of great benefit for me! I was able to get a more realistic view on what kind of music I’ve been enjoying for the past 4 years and how that taste changed. Or not. As I mentioned earlier, this has an extra importance for me since now I’m focusing on a more serious DJ enterprise, having this kind of crude analysis of songs (provided by Spotify) enhances how I understand music.

I wish I was more proficient on R language in order to write and create more professionally. But I value these kinds of tasks that take me out of my comfort zone to learn new things, which was the idea of this whole course. I’ve had a few issues on data transformation and overall language syntax and writing styles but I think I managed to get it done.

I also am aware that many more analysis could have been done. There was the genre feature that I missed. I could have scraped to get it as well but time was a constraint here.